Re-balance roles deterministically instead of randomly #209

carlcsaposs-canonical · 2024-03-25T12:02:03Z

(for auto-generated roles)

Algorithm:

if odd # of nodes: all nodes are cluster manager eligible
if even # of nodes: node with highest unit number is data node, all other nodes are cluster manager eligible

Replaces old algorithm of random.choice one of the nodes that should change roles

This will enable in-place upgrades to be ordered from highest to lowest unit number even if roles are re-balanced (for HA [e.g. because of network cut]) while upgrade is in-progress

That allows upgrade/rollback to be coordinated without leader unit setting information in peer databag, which allows partial rollback even if leader unit is in error state

Context:
https://chat.canonical.com/canonical/pl/zpqoyei54ffr5kwgwkbimizhwo
https://warthogs.atlassian.net/browse/DPE-3878

Also, this algorithm ensures that if a unit needs to be restarted for HA while upgrade in-progress, a unit (the highest number unit) on the newer version of the charm/workload will be upgraded. This reduces the likelihood that:

one of the last units on the older workload version restarts and
the cluster manager switches to operating with the new workload version (i.e. it stops operating in compatability mode with the old version—see https://discuss.elastic.co/t/rolling-upgrades-master-nodes-voting-config-exclusions/320463/2)

which would mean that units on the old version could not re-connect to the cluster

lib/charms/opensearch/v0/helper_cluster.py

phvalguima

Nice refactoring proposal!

lib/charms/opensearch/v0/helper_cluster.py

phvalguima

LGTM

Mehdi-Bendriss

Thanks Carl! Great work! I have some comments regarding the logic we use to recompute the roles based on the count of only the nodes of the current cluster, whereas we should be based on the entirety of the fleet

lib/charms/opensearch/v0/helper_cluster.py

lib/charms/opensearch/v0/opensearch_base_charm.py

(for auto-generated roles) Algorithm: - if odd # of nodes: all nodes are cluster manager eligible - if even # of nodes: node with highest unit number is data node, all other nodes are cluster manager eligible Replaces old algorithm of `random.choice` one of the nodes that should change roles This will enable in-place upgrades to be ordered from highest to lowest unit number even if roles are re-balanced (for HA [e.g. because of network cut]) while upgrade is in-progress That allows upgrade/rollback to be coordinated without leader unit setting information in peer databag, which allows partial rollback even if leader unit is in error state --- Also, this algorithm ensures that if a unit needs to be restarted for HA while upgrade in-progress, a unit (the highest number unit) on the newer version of the charm/workload will be upgraded. This reduces the likelihood that: - one of the last units on the older version restarts and - the cluster manager switches to operating with the new workload version (i.e. it stops operating in compatability mode with the old version—see https://discuss.elastic.co/t/rolling-upgrades-master-nodes-voting-config-exclusions/320463/2) which means that units on the old version cannot re-connect to the cluster

Highest unit number cannot always be determined (e.g. if unit hasn't joined peer relation). Unable to use `planned_units` since unit numbers are not necessarily sequential

max([]) raises ValueError

Mehdi-Bendriss

Thanks Carl!

carlcsaposs-canonical commented Mar 25, 2024

View reviewed changes

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved

carlcsaposs-canonical marked this pull request as ready for review March 25, 2024 12:56

carlcsaposs-canonical requested review from juditnovak, phvalguima and Mehdi-Bendriss March 25, 2024 12:56

phvalguima reviewed Mar 26, 2024

View reviewed changes

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved

carlcsaposs-canonical requested a review from phvalguima March 26, 2024 11:03

phvalguima previously approved these changes Mar 26, 2024

View reviewed changes

Mehdi-Bendriss reviewed Apr 2, 2024

View reviewed changes

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved

lib/charms/opensearch/v0/opensearch_base_charm.py Show resolved Hide resolved

carlcsaposs-canonical added 6 commits April 4, 2024 14:25

Add comment about omitting unit 2

f4b23ae

Revert changes to suggest_roles

4579a1b

Highest unit number cannot always be determined (e.g. if unit hasn't joined peer relation). Unable to use `planned_units` since unit numbers are not necessarily sequential

fix lint

bba518d

Add handling for empty nodes

52321f9

max([]) raises ValueError

Add logs

0b97602

carlcsaposs-canonical dismissed phvalguima’s stale review via 0b97602 April 4, 2024 12:25

carlcsaposs-canonical force-pushed the deterministic-role-re-balancing branch from 319a736 to 0b97602 Compare April 4, 2024 12:25

carlcsaposs-canonical requested a review from phvalguima April 4, 2024 12:26

carlcsaposs-canonical mentioned this pull request Apr 4, 2024

Fix role re-balancing for large deployments #214

Open

Mehdi-Bendriss approved these changes Apr 4, 2024

View reviewed changes

phvalguima approved these changes Apr 5, 2024

View reviewed changes

carlcsaposs-canonical merged commit 4d27728 into main Apr 5, 2024
21 of 23 checks passed

carlcsaposs-canonical deleted the deterministic-role-re-balancing branch April 5, 2024 07:31

carlcsaposs-canonical mentioned this pull request Apr 12, 2024

Remove role re-balancing #230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-balance roles deterministically instead of randomly #209

Re-balance roles deterministically instead of randomly #209

carlcsaposs-canonical commented Mar 25, 2024 •

edited

Loading

phvalguima left a comment

phvalguima left a comment

Mehdi-Bendriss left a comment

Mehdi-Bendriss left a comment

Re-balance roles deterministically instead of randomly #209

Re-balance roles deterministically instead of randomly #209

Conversation

carlcsaposs-canonical commented Mar 25, 2024 • edited Loading

phvalguima left a comment

Choose a reason for hiding this comment

phvalguima left a comment

Choose a reason for hiding this comment

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

carlcsaposs-canonical commented Mar 25, 2024 •

edited

Loading